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ABSTRACT 


Recent advances in reinforcement learning (RL) have propelled the idea that 
artificially intelligent agents may one day replace humans in performing complex tasks. 
There are numerous challenges associated with moving RL from a_ simulated 
environment to the real world. In particular, understanding the decision-making process 
of the RL agents and ascertaining the viability of use in safety-constrained environments 
are key challenges. An evolving approach to addressing these challenges is to impart 
human knowledge into the learning algorithms. Through a comprehensive evaluation 
using a Pong RL agent, this thesis provides evidence that incorporating human influence 
into an RL algorithm can cause a strategy conflict and impede learning. In particular, it 
shows that (i) there is an inflection point measured by training episodes with respect 
to the positive effect of incorporating human influence for the Pong agent, and that (11) if 
human influence is not decayed beyond the inflection point, the negative effect can 


intensify and eventually undo all prior training gains. 
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CHAPTER 1: 


Introduction 


reinforcement learning (RL) is the process of teaching an agent to perform a task by allowing 
it to explore different strategies. The machine will adjust its strategy to reinforce behaviors 
that are working well. In recent years, RL algorithms have proven their ability to train artifi- 
cially intelligent agents that can outperform humans [1 ]—[3]. There are numerous challenges 
associated with moving RL from simulations to real-world applications [4]. In particular, 
understanding the decision-making process of RL agents and ascertaining the viability of 
use in safety constrained environments are key challenges. An evolving approach to address- 
ing these challenges is to impart human knowledge into learning algorithms. Approaches 
to human knowledge integration include automated or real-time human feedback [5], [6], 
shaping the reward system to encourage human-like responses to environmental change [7], 
providing human demonstrations [8], and policy shaping [6] that encourages the develop- 
ment of specific strategies. Although research has shown these methods can be used to 
prevent undesirable actions and improve performance, there has been no prior work to ana- 
lyze the strategies developed by agents trained under human advice, or to consider adverse 
consequences. This thesis motivates the importance of comparing agent policies trained 
using human influence with the expected behavior being imparted and understanding the 


role of human trainer input in shaping the overall policy learned. 


1.1 Problem Statement 


The use of humans in RL as either a feedback provider or trainer has not been studied from 
an explainability or interpretability perspective. We suspect that the use of human trainers 
in RL may be an impediment to learning and that agents do not develop the strategies being 
encouraged by human trainers. Furthermore, the underlying policies that represent agent 
strategies are often ignored. This thesis takes a holistic approach to studying RL examining 


not only performance but policy development and strategy. 


More specifically, we examine the resultant agents from Walton’s work integrating heuristics 
into Pong RL agents [9]. Walton’s agents showed an interesting performance behavior where 


there was rapid improvement early in training, but performance declined later on to the 


| 


point that the agent not only forgot all it had learned earlier but also had difficulty learning 
again. Walton looked only at performance and did not evaluate whether the human trainer 
was effective at teaching the desired strategy. He also did not offer any explanation for 
why there was performance collapse. We believe the study of Waltom’s agents from an 
explainability and interpretability perspective is compelling and will provide new insights 


into RL algorithms that incorporate human advice. 


1.2 Research Questions 

¢ How does using a human trainer to bias action selection affect the strategy develop- 
ment of RL agents. Can an agent be encouraged to learn a specific type of strategy 
provided by a human trainer. Is the influence from a human an impediment to the 
agent learning an optimal strategy? What are the repercussions? How can the strategy 
adopted be validated? 

¢ What is the role of the policy network? What affects do hyperparameters have on 
performance and strategy development? What information can be derived from the 
policy network to understand the decision-making process of RL agents? 

e Is there a way to formally show that human trainers adversely affect learning? We 
seek to assess the risk that an adversary may hijack human trainer input to attack an 


RL system. 


1.3 Organization 

This thesis is organized into seven Chapters. Chapter 2 includes background information 
on RL, the use of human knowledge and feedback to accelerate learning, and methods 
of attacking agents. Formal definitions are provided when appropriate and many of the 
concepts introduced will be revisited throughout this thesis. Chapter 3 details related works. 
This thesis relies exclusively on the OpenAI framework for testing RL algorithms. The 
policy network and human trainer algorithm is described in detail. Chapter 4 outlines the 
methodology used to answer the research questions. This thesis was conducted in three parts. 
First, the source code was redesigned to allow for rapid prototyping and reproducibility. 
Secondly, four categories of experiments were conducted with each requiring multiple 
agents to be trained and evaluated. Third, the experiment results were evaluated using a 


combination of methods to include reward curves and saliency maps. Chapter 5 describes 


Z 


each experiment in detail and summarizes results. Chapter 6 describes a new human trainer 
introduced to validate additional hypotheses formed from Chapter 5. Lastly, Chapter 7 


discusses the limitations of the experiments and provides recommendations for future work. 
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CHAPTER 2: 
Background 


This chapter contains an overview of three core concepts that provide the foundation for 
this thesis. First, RL is discussed alongside its mathematical foundation. Second, methods 
of integrating human knowledge into RL algorithms are explored. Lastly, a discussion of 


agent vulnerabilities and adversarial attacks on agents is provided. 


2.1 Reinforcement Learning 

RL attempts to solve the problem where an agent interacts with an environment and receives 
some type of cumulative reward. The actions are determined by a policy, and the agent 
receives feedback for their actions through a reward signal. RL incorporates many fields of 


study to include computer science, engineering, mathematics, and machine learning. 


This section begins by introducing the language and notation used in RL. The formal math- 
ematical definitions and derivations were obtained from [10], unless otherwise indicated. 
Next, a short description of the types of RL algorithms is provided. Last, policy optimization 


is analyzed in detail. 


2.1.1 Definitions 

The agent is an artificially intelligent entity trained to perform some set of tasks for a 
specific environment. An agent can be a player in game, robot taught to walk, helicopter 
capable of autonomous transport, or stock market trader. In general, agents are limited to 
the environment and action constraints they are trained under. That is, a great chess playing 
agent has no understanding of how to play checkers. Still, RL algorithms have been effective 


at training agents to perform very well in games and other specialized tasks. 


The environment is the world that the agent lives. The more complex the world, the more 
difficult it becomes to train effective agents. For a robot vacuum cleaner, the environment 
is the floor space it must learn to clean. The vacuum’s world is simpler than that of a 
sophisticated stock market trading agent that has to track thousands of stock symbols, 


interpret news and reports, forecast earnings, gauge social media sentiment, and incorporate 


) 


the geopolitical climate. The state representation of the environment is a snapshot in time 
of the agent’s world. The state could be a set of sensor readings, a pixel map received from 


a camera, or some other type of input data the agent learns from. 


Agents are limited to some action space that contains the set of things an agent is allowed to 
do in the environment. A robot vacuum cleaner might have a discrete action space containing 
eight cardinal directions, or a continuous action space that allows it to manipulate actuators 
with varying degrees of rotation and speed. Recent work from DeepMind showed agents 
can learn to act intelligently even in a large discrete action space [3]. More specifically, 
DeepMind’s agent learned to play StarCraft IT, a complex game where players must control 
hundreds of entities (buildings, military units, etc.) at once. The number of possible actions 
in StarCraft IT is the combination of possible manipulations to each entity, which was found 


(26 


to be approximately 10°° actions at every time-step. 


A reward is typically a scalar value that provides an agent with feedback after they perform 
an action. If an agent is doing something good, it is rewarded a high positive value that 
tells the agent to keep doing that good thing. Remember, human system developers are 
the architects of reward signals agents use to learn. A faulty reward signal may lead to 
unintended behavior. Perhaps the agent will find a loop-hole to cheat in a game because 
the reward signal encouraged the agent to do so. Rewards are often discounted by some 
scalar value that determines the value of actions based on rewards to be gained in the 
future. Discounted rewards are needed to help the agent determine whether actions taken 
throughout the game were good or bad. For example, consider the game of chess with +1 
reward signal for winning a game, —1 for losing a game, and O otherwise. If the agent wins 
a game, the +1 reward will be discounted such that every move taken that resulted in the 
win receives some positive value. The first move will be rewarded less than the second, the 


second less than the third, and the last move will receive the highest reward. 


The policy is a map that links the state of the environment to actions. In a deterministic 
policy, the mapping to actions is static for a policy. Alternatively, a stochastic policy uses the 
probability of taking actions given a state. A stochastic policy encourages agents to explore 


the unknown and can lead to interesting strategy development. 


An episode is a sequence of alternating states, coming from the environment, and actions, 


taken in the environment. In a game scenario, an episode might be all of the states and 


6 


actions that occurred before the game ended. 


Markov Decision Process (MDP): Assume the agent takes actions at discrete time steps, 
t = 0,1,2,...,00. At any given time, the environment will be in a Markov state, s;, if 
P[s;41]s] = Pls;41|51,...,5,]. If the state is the same for both the agent and environment 


for all time steps, the environment is said to be fully observable and the system a MDP [11]. 
A MDP is a 5-tuple (S, A, P, R,y), where 


¢ S: is a set of possible states. 
¢ A: is a Set of possible actions. 
°P:SxS — P(S) is the state-transition probability function, where P(S) is the 


probability of transitioning from one state to another given an action. 


R: Sx Ax S — & is the reward function that produces a scalar reward signal given 
the current state, current action, and future state. 


e y is the discount factor. 


Figure 2.1 provides a general overview of the RL problem. The agent receives an initial state 
from the environment, so € S, and takes an action, a9 € A dictated by the agent’s policy, 
mz. This is denoted as 7;,(a|s). Att = 1, the agent is in state, s;, and receives a reward, r; + 1, 
for the previous action taken. This cycle continues for a specified amount of time steps, or 
until the process terminates. The objective is to find an optimal policy, 2*, that maximizes 


the discounted return over many episodes. 


St+1 


Environment 





Figure 2.1. Reinforcement Learning Process 


The discounted return is given as 


T 
Cice k 
i= Y Vt+k+i 
k=0 


where y is a parameter, (0,1), that determines the value of actions based on rewards to 
be gained in the future, and T is the number of time steps in the episode. Intuitively, the 
discounted return includes reward signals at all time steps, but considers rewards received 
far in the future as being less valuable than rewards received immediately. Consider the 
chess example from earlier where the agent receives a reward of +1 for winning a game, —1 
for losing a game, and O otherwise. Table 2.1 shows the discounted reward when y = 0.99 


compared to y = 0.01. 


fe i 2 3 (4s 
im _|o|o  |o {o joji 
G.(7=0.99) | 0.9605 | 0.9702 | 0.9801 | 0.99 [1 | Game over 





Table 2.1. Comparing Discounted Rewards 


Note that G; gives us the reward for a specific time step. For example, we calculate Go for 
y = 0.99 as 


5 
Go = > 0.99 rose 
k=0 


= (0.9997, + 0.99! 7 + 0.997r3 + 0.99774 + 0.99715 
-~04+0+40+0+0.9605 = 0.9605. 


For y close to 1, the actions taken earlier in game are considered to be almost as important 
as actions taken later. This makes sense for chess where even the first move taken in a game 
can have long term effects on the strategy. When y is small, the agent will favor immediate 


rewards. The action taken at t = 0 produced a very small reward signal that tells the agent 


the move did not have a significant impact on the overall win. A small y would not make 


sense for a chess agent. In fact, most RL algorithms use a discount factor close to 1. 


Assume an episode begins with starting state sg and ends at time step 7, the probability of 


some sequence T = (50, do, 51, G1, ..., aT, Sr) occurring for policy, 2, is expressed as 


T-1 


P(t|x,T) = po (So) | | P (St41|8t, At) 1 (at|Sz) 
t=0 


where /( (Sg) is the probability of starting state sg based on the start-state distribution, and 
P is the state-transition probability function. The optimal policy maximizes the expected 


reward 


nq = arg max f P(rln T)R(t). 


R(t) is a function that takes the sequence t and computes the expected reward. In most 
cases, the expected reward will be the sum of discounted rewards. Solving for the optimal 
policy assumes the state-transition probability function is known, which is of often not the 


case. Instead, most RL algorithms estimate value functions. 


The state-value is the expected reward an agent would receive if it were in that state. This 
is given as 
V"(s) = E,[G;|S; = 5] 


The state-value function can be decomposed into the Bellman Equation for v, to provide a 


more intuitive expression. 


V"(s) = Ex[G;|S; = 5] 


3 y* Rieke 


k=0 


= 15, 








Sn 


Ri ty >» y* Risks? S: = ; 
k=0 


=D mals) D) Psrlsalr + VCs") 


-E, 








where s” is a possible next-state. The state-value function is the expected sum of the reward 
an agent receives moving to s’ and the future reward from s’. The agent is looking at 
all possible next-states, weighing the discounted value of the next-state and the reward 
received by the probability of choosing an action that leads there. It is an average over all 


the possibilities. 


The expected reward given a state and an action is given by the action-value function, 
Qn(S,a) = En [G;|S; = 5, A; = a]. 


Similarly, the Bellman Equation for g,; can be shown to be 


O"(s,a) =) v(s‘rls,a)|r+y > x(a'|s)O(s',’) 


To solve the RL problem requires finding an optimal policy, 2”, that maximizes rewards. 
An optimal policy requires that V7(s) > V7 (s) for all 2’ and all states. There will be at least 
one, but possibly multiple, optimal policies that achieve the optimal value function, V” , 
and the optimal action-value function QO” . Thus, the optimal policy is found by maximizing 


over O*(s,a) or V*(s). This gives the Bellman Optimality Equations: 


V*(s) = max Drs rlssa)[r + VG") 


10 


O"(s,a) = » Ps’, r|s, a) + + ymax O"(s’,a’) 
Sr ¢: 


The Bellman Optimality Equations are non-linear and can be solved using dynamic program- 
ming, i.e. through iterative methods, for MDP’s with a small number of states. However, as 
the state space becomes large, it becomes infeasible to iterate through the set of all possible 


States. 


2.1.2 Model-Free RL Algorithms 
Many algorithms have been proposed to solve RL problems. An in-depth analysis of each 
method is beyond the scope of this thesis. However, providing a theoretical framework for 


the various approaches gives a foundation for later analysis. 


This thesis is only concerned with model-free algorithms, i.e. the agent does not know the 
true model of the environment. Agents learn from experience gained through interaction 
with the environment. In contrast, model-based approaches assume a ground-truth transition 


probability function and reward function. 


Model-free RL algorithms can be divided into two general categories: Policy Optimization 


and Q-Learning. 


Q-Learning approaches generally approximate the optimal action-value function with some 
function QOg(s,a), where 6 represents the parameters of the approximator. The objective is 
to find parameters @ that best represents QO“. A popular method of approximating the value 
function is to use a deep neural network. Mnih et al. used a Deep Q-Network, containing 
a convolution neural network, to train agents to play Atari 2600 games with better than 


human performance [12]. 


Policy Optimization 

Policy-based methods encompass a family of algorithms that solve RL problems by finding 
parameters, 0, for the best 7. These methods often use gradient ascent on some performance 
objective, J(7@). It is also common to approximate the value function with some function 
Vo(s), where ¢ is the set of parameters. In the asynchronous advantage actor-critic (A3C) 


algorithm, the value estimate is used to update the policy parameters [13]. 


11 


Since this work is focused on model-free learning using a policy gradient method for 
function approximation, we derive the simplest form of the policy gradient following [10], 
[14]. 


Assume a stochastic, parameterized policy, zg, and a reward function, R. We want to 
maximize the expected return if the policy was used to take actions in the environment. 
This is given as, 

J(xe) = _E [R(r)]. 


The policy is optimized using gradient ascent, where 
Ox+1 = Ox + @VoGI (TO), 
a is the learning rate, and k is the update step. The gradient of the expected return is then 
VoJ (m9) = Vo_E [R(r)] 
T~IG 
= Vo f Perl) 
T 

= [VPI RG) 


T 


7 / (rl) oe R(x) 


[ Pcie. log P(t|@)R(t) 


E [Vo log P(z|@)R(r)) 


T~T@ 


This is an unbiased gradient estimator. If we assume some episode, T;, sampled from P(tT|@), 


then the gradient for that sample becomes 


gi = Vo log p (x;|@) R(z). 


Intuitively, R (7;) represents how well the policy performed under parameters 6. Applying 
the update rule to the parameters increases the log-probability of samples that result in 


higher rewards. In practice, gradient updates are averaged over many episodes. The policy 


12 


only updates periodically. This prevents the model from overfitting to any one particular 


episode. 


2.2 Accelerating Learning with a Human Trainer 

The role of humans in RL can extend beyond defining the environment, reward function, 
value functions, and learning algorithm. Agent exploration of complex environments can 
be very slow, and they are likely to make poor decisions that even a novice human would 
avoid. For example, a person driving a vehicle knows to stop at an intersection if there is 
a red light. Agents need to drive through many red lights, get into several accidents, and 
receive lots of large negative rewards before learning to stop. Several methods of integrating 


human knowledge have been shown to be effective in RL problems. 


One of the earliest applications of using a human trainer in RL was in the field of robotics. 
Dorigo et al. used reward shaping to translate suggestions from a human trainer into 
an "effective control strategy" for the robot agent [15]. Reward shaping is the practice 
of modifying reward signals to guide agent learning. Ng et al. presented a more formal 
framework where the MDP is modified such that the reward function becomes the sum of 
the original reward function and a fixed shaping reward function. Actions can be encouraged 
by choosing a shaping function that gives positive rewards for favorable states. The shaping 
function can be implemented using a fixed set of state-action values, or with a human 


observer providing real-time feedback to the agent as in [16]. 


Another approach is to add action pruning rules that prevent agents from exploring part 
of the search space [17]. The rules use human knowledge of the environment to stop agents 
from taking actions deemed illogical or catastrophic. Action pruning has been widely studied 


in the context of artificial intelligence safety [18]. 


Knox and Stone introduced the Training an Agent Manually via Evaluative Reinforcement 
(TAMER) framework that allows a human to train a learning agent using a human reward 
function the agent tries to model [19]. In TAMER, a human trainer model is introduced, 
H, that returns a reward signal given a state and action, H : S x A — R. Knox and Stone 
provided eight methods of combining TAMER with RL [7]. This thesis is primarily concerned 


with using a human trainer to bias the action selection, 
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a’ = arg max[Q(s, a)+~y * H(s,a)| 
ae 


where y is used to decay the human trainer’s influence on action selection, and Q(s, a) is 


the Q-function. A variation of this function will be used and is discussed in Chapter 3.3. 


2.3 Attacking Reinforcement Learning 
RL agents are susceptible to adversarial attacks where the environment state is manipulated 
to cause the agent to behave improperly. Figure 2.2 shows how the state of the environment 


can be influenced by an either an adversarial perturbation or an adversarial policy. 


Szegedy et al. demonstrated crafted perturbations to inputs cause deep neural networks to 
misclassify images [20]. RL algorithms that rely on image pixels from the environment are 
vulnerable to similar perturbation attacks as shown in [21]. Agents performed poorly when 
presented with perturbed examples that a human would view as normal. Perturbation attacks 
have real-world implications on systems that rely on computer vision for decision-making. 
An adversary could modify an object in the physical world in order to confuse the agent 
into making dangerous decisions. Lin et al. proposed a method of detecting perturbation 
attacks using a prediction model that tries to predict frames based on past observations [22]. 
When the predicted frame and received frame are drastically different, an alternative action 


selection method is used. 


The perturbation attacks assume the adversary is able to modify an agent’s observations. 
Gleave et al. showed agents can be attacked by an adversarial policy [23]. The adversarial 
policy represents the behavior of an opponent, or some other actor in the environment, 
that observes the victim’s actions and learns to exploit gaps in their strategy. For example, 
consider an agent trained to play chess against some opponent. There are 20 possible actions 
either player can take as their first move. Of those 20 possible actions, there are a subset 
of actions considered irregular in the chess community. That is, there are some opening 
moves that are not commonly used by human players. Assuming most of the games played 
by the agent during training did not include irregular openings, an adversarial policy would 
purposefully make irregular moves. The adversary hopes to create a situation the agent is not 
familiar with. A non-robust agent might then make poor decisions in response to irregular 


behavior. The adversary can exploit this pattern to defeat the agent. 
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This thesis investigates two questions: first, can a human trainer have a negative impact on an 
agent when providing feedback during training; second, can a human that provides genuine 
feedback during training also have a negative impact on the agent. Consider the bottom 
diagram in Figure 2.2 where the agent’s action is altered based on knowledge provided 
by the human trainer. If the agent performs poorly because of the human trainer, the 
implementation or human knowledge provided may be bad. However, if the agent performs 
well in the early stages of training, but has decaying performance in the long-term, then the 
human trainer might be considered an adversary. That is, the human trainer prevented the 


agent from learning a better strategy despite having provided logical advice. 
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Top: Adversarial perturbation attacks alter the environment state directly. Middle: 
Adversarial policy attacks attempt to create unexpected states the agent is not 
trained to handle. Bottom: This paper investigates whether a adversarial human 
trainer can be used to attack agents. 


Figure 2.2. Adversarial Attacks on RL 
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CHAPTER 3: 
Related Work 


This thesis leverages a collection of previously published works and open-source code. 


These topics will be discussed in further detail. 


3.1 Atari 2600 Game Environments and OpenAI 


Bellemare et al. [24] first introduced an interface to Atari 2600 video games that provided 
a framework for research. Mnih et al. used a convolutional neural network to process game 
states represented as pixel frames [1]. The agents learned to play Atari video games using 
the same visual information a human would use. Later, OpenAI developed gym, a toolkit for 
testing and comparing the performance of RL algorithms [25]. The OpenAI toolkit includes 


Pong, the classic Atari game used in this work. 


The gym version of Pong is simple. In the environment, the agent controls a paddle that must 
hit a ball towards the opponent. The opponent controls a paddle and will attempt to return 
the ball to the agent. If the ball gets past the opponent, the agent will score +1. Similarly, if 
the opponent gets the ball past the agent, the opponent scores +1. A game ends when either 


the agent or opponent receives +21 points. 


3.2 Karpathy’s Policy Network for Pong 

The policy implementation for the Pong game is largely based on Karpathy’s work [26]. 
Karpathy’s algorithm uses a neural network to represent the agent’s policy, z(a|s; W1, W2), 
where s is a game frame, and W;, W2 are weight matrices for the two-layer neural network. 
The policy returns the probability of moving the paddle up. That is, the output node of the 


neural network represents the probability of selecting the up action. 


The input state, s, is a length 6400 vector containing the difference between frames from 
two consecutive time steps. The environment outputs a 210x160x3 matrix. The matrix is 
preprocessed to reduce the computational overhead for the policy network without remov- 


ing relevant information such as where the ball and paddle are. This is done through a 
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Figure 3.1. Pong Agent Policy Network 


combination of cropping out irrelevant pixels and downsampling. Background pixels are 
set to zero. Paddle and ball pixels are set to 1. The result is flattened to produce the length 


6400 vector that is subtracted from the previous time step frame. 


Figure 3.1 depicts the the policy architecture. The network output is computed as 


Pup = Sigmoid(ReLu( (5 — S;_1)- W1] Wo 


Nonlinearity is introduced in the hidden layer using a rectified linear unit, Relu(x) = 


ok 
l+e-** 


tion function will produce an output between O to 1 taken to be the probability the agent 


max(Q, x), and at the output layer using sigmoid, Sigmoid(x) = The sigmoid activa- 


should move the paddle up. 


The policy network weights are initialized using the Xavier method [27] where W1 ~ 


N(O, =55) and W2 ~ N(0, =a) That is, the weights are sampled from a uniform 


distribution with mean zero and variance of Jt , where n is the number of input nodes for 
each layer. The goal is to tune the weights until the neurons activate in such a way that they 


can represent a strategy for the agent given different game scenarios. 
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The policy network returns a probability of selecting the up action. The actual move the 
agent makes is determined by drawing from a uniform distribution, U[0, 1] and comparing 
the sample with the networks output. If the sample is less than p,,, the agent will move 
the paddle up. Otherwise, the paddle moves down. Note, Karpathy’s algorithm disregards 
no-op. Drawing from a sample distribution makes the algorithm stochastic and allows the 


agent to explore different strategies similar to an epsilon greedy strategy. 


The policy network parameters are tuned using batch gradient descent with an optimizer. 
The agent plays a specified number of games. The input frames, output probabilities, hidden 


layer activations, and rewards are buffered for every game played in the batch. 


The gradient used to compute the parameter updates during backpropagation depends on 
the action selected by the agent and the probability. If the agent moved the paddle up, the 
gradient becomes the product of (1 — pyp)) and the normalized discounted reward for the 
time step. Similarly, if the agent moves the paddle down, the gradient is the product of 
(—Pup) and the normalized discounted reward for the time step. The discounted reward is 
computed for each game using a y = 0.99. The hidden layer activations and gradient is used 
to generate model updates using backpropagation. The updates will encourage the actions 
that led to positive rewards and discourage those that led to negative rewards. Updates are 
accumulated for a batch and applied using the RMSProp [28] optimizer at the end of a batch 


cycle. 


3.3. Walton’s Human Trainer 

Walton’s human trainer will be used to study the behavior of agents influenced by human 
input [9]. Intuitively, Walton’s human trainer teaches a defensive strategy that a human 
player might use. The strategy is to keep the paddle at the same vertical position as the ball. 
If the ball is higher than the paddle, move the paddle up. Move down if the ball is below the 
paddle. Keep the paddle in place if the ball and paddle are aligned. Walton’s implementation 


disregards the direction the ball is traveling. 


Using the defensive strategy, Walton’s algorithm returns a recommended action and ad- 
justs the output from the policy network, either increasing the probability of moving up, 
decreasing it, or leaving it unchanged. The amount of influence the recommendation has 


on the network’s output is dependent on the intensity parameter chosen in range (0,1). A 
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Figure 3.2. Reward Curve for Walton's Agent at 10K Episodess. Adapted 
from [9]. 


higher value means the human’s recommended strategy will have more influence on action 


selection. 


Walton showed that human logic was able to accelerate learning in the initial training steps 
for the Pong game as shown in Figure 3.2. However, Figure 3.3 shows that the policy using 
human logic rules began to produce poorly performing agents as training surpassed 30,000 
episodes. This result suggests it may be possible to disrupt an agents’ ability to learn using 
rules that give the appearance of improving learning performance at the onset, but result in 


anomalous behavior later on. 
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CHAPTER 4: 
Methodology 


The goal of this thesis is to understand the behavior of RL agents trained under human 
influence. In particular, we explore how agents may be adversely affected by a human 
trainer providing a legitimate strategy, as previously reported in [9]. Karpathy’s policy 


network and Walton’s human trainer provide the basis for this thesis. 
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Figure 4.1. Methodology Overview 


Our approach to understanding agent development was conducted in three parts as in Figure 
4.1. 


First, the source code from Karpathy and Walton was redesigned to leverage Graphics 
Processing Unit (GPU) acceleration, enable reproducibility, and collect new data for post- 


training evaluation. 


Second, a series of experiments were performed oriented around four focus areas: repro- 


ducing baseline results from Walton’s work, varying hyperparameters, modifying the policy 
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network, and retraining “bad” weights. 


Third, data produced from the experiments was thoroughly analyzed using several tech- 


niques, such as activation plots and saliency maps. 


4.1 Redesign 


The redesign made three important changes to the Walton’s version of the source code. 


The first change added GPU support. GPUs speed up training and have been shown to 
significantly outperform CPUs in deep learning applications. Training a Walton’s Pong 
agent for 100,000 episodes took approximately two weeks using the Naval Postgraduate 
School’s High Performance Computing (HPC) system (see A.1 for details). This was not 
sufficient for the scale of experiments we expected to conduct. GPU acceleration allowed 


us to train many agents in a reasonable amount of time. 


The second change allowed us to have fully reproducible results. In particular, training 
that had been interrupted could restart from the last saved model update while still being 
reproducible. This was important because agents were trained on shared hardware and jobs 
were disrupted often. We also anticipated power outages that would require hardware to be 
shutdown. Having reproducibility and retraining allowed us to pick up training where it left 
off. 


The final set of changes added additional code to support data collection. In addition to the 
model parameters and running reward, which are already periodically recorded, we also 


needed to collect the hidden node activations and actions taken by the agent. 


4.1.1 GPU Implementation 

Our initial attempt to perform deep learning in GPUs used TensorFlow (TF) [29]. Developed 
and open sourced by Google Inc., TF includes a wide array of attractive features useful to 
this thesis. TF supports GPU hardware acceleration and automatic differentiation. The 
library includes commonly used optimizers for gradient descent, including the RMSProp 
optimizer used in Karpathy’s policy network, and models can be saved with metadata 
making retraining simple. However, the computational pipeline is stored in GPU memory 


as a session graph. Collecting neuron activations and other intermediate states requires 
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the graph be initialized. This made debugging cumbersome because tensors could not be 
accessed directly. Initial results from our TF implementation of the policy gradient algorithm 
with Walton’s human trainer failed to reproduce agent behaviors seen in Walton’s work. At 
the time, we believed there was some TF-specific optimization in the auto-differentiation 
library. Later, we learned there was a bug in our implementation of the human trainer, but 


by then we had abandoned TF in favor of an alternative library, CuPy. 


CuPy is an open-source project that provides a NumPy-compatible library for GPU cal- 
culations [30]. CuPy is compatible with NVIDIA GPUs and uses the Compute Unified 
Device Architecture (CUDA) parallel computing platform (the same architecutre of TF). 
In most cases, NumPy, a library that provides mathematical functions that supports mul- 
tidimensional tensors, calls can be replaced with similar CuPy calls. Implementing the 
algorithm proved far simpler in CuPy compared to TF. The redesigned implementation 
takes a GPU identifier as an argument and checks for device availability. If the device is 
available, training begins using the GPU. Of note, CuPy only uses GPU acceleration on 
the mathematical operations where all inputs are CuPy tensors. Therefore, CuPy will not 
outperform a framework like TF where the entire network pipeline is optimized for GPU 


acceleration. 


Since we periodically write portions of the policy network to disk, it was important to 
understand which variables were being stored in GPU memory and which were in system 
memory. Accessing data in GPU memory too often can slow down training because data 
has to be copied to system memory before being saved to disk. We avoid unnecessary 
conversions from CuPy to NumPy when possible. For example, we do not convert the 
observation returned by gym until after it has been preprocessed, passed to Walton’s human 
trainer for an action recommendation, and used to compute the temporal difference frame. 
Thus, we did not have to convert to CuPy those parts of the source code that do not 


necessarily need GPU acceleration. 


4.1.2 Enabling Reproducibility 
The ability to retrain agents and have those agents be reproducible were two established 
criteria for experiments. Accomplishing this required we identify and set seeds for any 


pseudorandom number generators in the system and account for the state of the optimizer. 
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There are two pseudorandom number generators (PRNG) in use. First, the CuPy library uses 
pseudorandom number generation to initialize the starting weights and draw samples from 
uniform and normal distributions. Setting the seed for the library allowed us to reproduce 
starting agents. Second, the environment created by gym introduces randomness in the 
episode. For example, when Pong starts, the ball can move left or right. The trajectory 
of the ball at start, and the action of the opposing agent, are also deterministic. The gym 
environment has a method seed() that was set to control exactly how the the ball starts for 


each episode. If the ball starts the same, the opposing Atari will make the same actions. 


Retraining partially trained agents and having those agents be the same as those that did not 
have a pause in training required more than setting the two seeds for the random number 
generators. Assume Agent A is trained for 1,000 episodes with a batch size of 100 and seed 
of one. Agent B is trained for 2,000 episodes with a batch size of 100 and seed of one. If the 
weights of Agent A are loaded, and the agent is trained for 1,000 more episodes, it will not 
have the same weights as Agent B. Agent A’s weights were used as the initialized weights 
for training, but the random number generator for CuPy and gym were not in the same state 


as when Agent A ended training the first time. 


Algorithm 1: Replay Algorithm 


Set seeds; 


2 previous_actions <— load recorded actions; 


3 
4 


5 
6 
7 
8 
9 


10 


for episode_actions in previous_actions do 
iO; 
if episode not over then 
action — episode_actions[i]; 
play action in environment; 
sample from uniform distribution; 
i++; 
else 
reset environment; 
end 


end 


Figure 4.2. Replay Algorithm 
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The Replay Algorithm in Figure 4.2 is introduced to enable retraining agents that are also 
reproducible. Replay is used to get the state of the random number generators back to where 
it was when training ended. The CuPy and gym seeds are set to the same values previously 
used. The weights are initialized with CuPy and then overwritten with the weights we want 
to retrain from. The actions taken by the Pong agent are loaded and replayed into gym. For 
each action taken, we have to sample from a uniform distribution using CuPy. This gets the 


CuPy random number generator back to the same state. 


Agent actions, weights, and the previous state of RMSProp must be collected throughout 
training for this algorithm to work. The agent is not trained during replay, and there are no 
weight updates until after the the episode state is back to where the agent last played. Agents 
can only be retrained from the last episode data was collected from. The data is collected 
for analysis regardless of whether agents are retrained, so the computational overhead of 


data collection does not impact training performance. 


4.1.3 Data Collection 

The scope of research required extensive data collection throughout training. Computational 
overhead and storage capacity were monitored and optimizations put in place when possible. 
In particular, some of the data collected resided in GPU memory and had to be moved to 
system memory before being written to disk. In this scenario, the interval between data 


dumps to disk were larger. 


Running Reward 

The running reward, r;, is a measure of the agent’s performance over tf episodes. It is a 
simple indication of whether an agent is improving its gameplay, or getting worse. The 
reward SUM, sym, 1S a per episode metric. It is the difference between the number of points 
the agent receives and the number of points the opponent receives. In a 21-point episode, 
if the agent scores 15 points and Atari scores 21, then the reward sum is -6. The weighted 
running reward is calculated as r, = 0.99 *_; + 0.01 * 75, .. Future episodes are considered 
more representative of the agent’s overall performance since the policy network had more 
opportunities to train. The running reward is computed for each episode, stored in memory, 


and periodically written to disk for future analysis. 
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Move Selection 

The agent is allowed to move the paddle up or down. The move decision is based on the 
output of the forward pass from the policy network, influence from the human trainer, and 
a sample from a uniform distribution. The actual move the agent makes after going through 
the decision process is recorded. Every move is tracked for all episodes. The moves can be 


analyzed for strategy patterns and are required to execute the Replay Algorithm. 


Optimizer Update 
The model parameters are updated using Karpathy’s implementation of RMSProp [28], 
an adaptive learning rate method. The optimizer accelerates convergence by adjusting the 


learning rate at runtime. The update is given as: 


E(9’|,=yvE [9’],,+0-yer 
Wr+] = Wy + — 
E Cae fe 


where y is the decay rate, 7 is the learning rate, € prevents division by zero, and g is the 
gradient. Each gradient represents is based on a single episode of Pong (a race to +21 
points). The mini-batches used to update the weights typically contain ten gradients. The 
weights are updated in large increments when the sign of the gradients are in disagreement. 
The agent is able to explore different policies. Once the signs of the gradients become 
mostly positive, or negative, the exponentially decaying average of squared gradients gets 
larger, and the updates get smaller. That is, the updates become more finely tuned when the 


agent discovers a working strategy. 


Computing the E lg? | , requires we keep track of E lg? | ,_;- 10 allow for retraining it is 
necessary to periodically save the previous update to disk. To limited storage requirements, 
only the most recent optimizer update is saved to disk. Retraining will only be able to start 
from the last episode trained if reproducible results are required. If results do not require 


perfect reproducibility after retraining, then this can be ignored. 
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Model Weights 

A neural network with 200 hidden nodes and a single output node requires 6400 weights for 
the first layer connecting the input to the hidden nodes, and 200 weights to connect the hidden 
layer to the output node. Using double precision floating point numbers, approximately 
1OMB is required to store a single iteration of the neural network. Given a training batch 
size of 10 episodes, to store every neural network for a training run of 100,000 episodes 
would require 1OO0GB of storage. The storage overhead becomes unacceptable as the number 
of experiments increases. The weights for the neural network are collected periodically to 
limit storage requirements. The default period will be increments of 2000 episodes. The 
network will have trained 200 times in that period. Finer data can be obtained by retraining 
from a particular episode. The state of the optimizer is only saved for the last batch trained. 
Retraining using a network other than the last one will introduce changes that may not be 


reproducible if the optimizer state is lost. 


4.2 Experiments 
There were four experiments performed to produce agents we could analyze. For each 


experiment, multiple tests were conducted with different seeds. 


4.2.1 Experiment 1 - Reproduce Results 

The first experiment established a baseline behavior for agents. Two tests were conducted 
to reproduce results from [9] and [26] using a redesigned implementation with additional 
data collectors and tunable parameters. The agents trained in experiment | were used as a 
benchmark for comparison. We also used the data collected to produce new plots that were 


able to provide additional insight into agents, strategies, and policy networks. 


4.2.2 Experiment 2 - Vary Hyperparameters 

The second experiment varied hyperparameters to determine how changes in the neural 
network’s policy affects Walton’s agent. We suspected that varying the intensity and hidden 
node parameters would alter the performance of agents. We also sought to find a pair of 
parameters that would cause the running reward to decay earlier in the training cycle. These 


parameters were used in future experiments to test hypotheses quicker. 
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4.2.3 Experiment 3 - Modifications to the Policy Network 

The third experiment trained agents using variations of the policy network. We implemented 
a decay factor for the human trainer. As discussed in 2.2, previously proposed algorithms 
with a human trainer included a decay factor. The idea is to only allow the human trainer 
to influence the agent early in training. As the agent performs better, the human trainer 
should have less influence on the policy. We also studied the effects of normalization and 
dropout on the policy network. These techniques have been used to prevent overfitting and 


are discussed further in Section 5.3 


4.2.4 Experiment 4 - Recovering Bad Agents 

The fourth experiment tries to recover agents trained for 1OOK episodes using Walton’s 
human trainer, as illustrated in Figure 3.3. We start a new training cycle using ‘dead’ 
weights taken from Walton’s human trainer agents. That is, the agents that performed well 
early in the training cycle, but later lost every episode. We wanted to assess whether the 


model parameters retained any learned behavior. 


4.3. Evaluation Methodology 


Watching an agent interact with the environment is the simplest method we can use to begin 
understanding the strategy learned by the network. However, this approach does not offer 
insight into the features of the environment that are most important to the network, or how 
the neurons are activating. We analyze agent performance using reward curves and try to 


understand strategy development using activation plots and saliency maps. 


4.3.1 Reward Curves 

Mnih et al. used training curves to track the agent’s average score and average predicted 
action-value over episodes [1]. Similarly, we plot the running reward over all training 
episodes. The reward curve is able to adequately represent the agent’s performance in the 


episode, but it does not offer insight into the agent’s strategy or the policy network. 
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4.3.2 Hidden Layer Activation Plots 

The node activations from the forward pass of the neural network are saved after the neurons 
pass through the ReLU activation function. The intent is to look for vanishing, degrading, or 
exploding activations. The representation of activation information is loosely based on [31] 
where violin plots were used to show activations of hidden layers. Since activations will 


never be below zero, we chose to use boxplots and focus on the upper region. 


4.3.3 Saliency Maps 

Saliency methods estimate the importance of pixels in images passed through convolutional 
neural networks. Estimates can be compiled into gradient magnitude heatmaps using the 
results of a pixel scoring function [32]. Greydanus et al. introduced a method for generating 
saliency maps that could be used to explain the strategies developed by deep RL agents 
playing Atari 2600 [33]. Gaussian blur was used to occlude each pixel in the input frame, 
similar to Fong’s perturbation method in [34]. Gu and Tresp were able to use saliency maps 


to explain adversarial attacks on networks [35]. 


Saliency maps were generated using a variation of the perturbation-based saliency method 
in [33] and is illustrated in Figure 4.3. We start with a frame, /, and feed it through the neural 
network to produce a probability L. Next, we apply Gaussian blur to pixel coordinate, (i, /), 
and recompute the probability, /, by feeding the blurred frame through the neural network. 
The scoring function S(/, L) = S||L — |||? provides us a measure of importance for the pixel 
(i, j). The greater the difference between probabilities, the more impact the pixel had on the 
neural network. Scores are computed for many pixels in a frame and mapped to a scoring 


matrix. 


The scoring matrix is upsampled to 160 x 160 pixels using bilinear interpolation. The 
upsampling makes the scoring matrix the same size as the Atari frame. The scores are 
normalized and then color mapped to RGB pixel values using OpenCV JET. The most 
important pixels appear blue while the less important pixels appear red. The scoring matrix 
represents the saliency map that is projected onto /. The algorithm described was performed 
on sequences of frames within an episode and stitched together to produce videos that could 


be analyzed. 


There are two main differences between our implementation and Greydanus et al. First, 
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there is no requirement to compute maps for the value estimate, since we do not have a critic 
agent. Second, images are represented as difference frames between time steps as opposed 
to the unprocessed pixels (i.e., frames as a human would see the episode). Saliency maps 


are projected onto the original episode frames for easier interpretability. 


There are two o parameters used for the standard deviation of the Gaussian kernel. One 
changes the radius of blur for the individual pixels being occluded. The second o parameter 
is used for the Gaussian kernel that is applied to the resultant score matrix after all pixels 
have been scored. The density parameters can be adjusted to reduce computational overhead. 
A density of 1 means every pixel will be scored. A density greater than 1 will skip pixels 
row and column-wise. The implementation blurs the scoring matrix, so skipped pixels will 


receive some score based on the pixels around it. 
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The top leftmost image is the original difference frame that gets input to the 
neural network to produce a probability of moving up, L = 0.9. The lower three 
frame have been masked with a Gaussian kernel at the designated pixel location. 
The three masked frames are fed through the neural network to produce some 
probability, 7, that is given to the scoring function. The scores are represented as a 
matrix with the location corresponding to the location of the occluded pixel. Once 
scores have been computed for all 6400 pixel locations (density=1), the Score 
Matrix is used as an overlay after being upsampled, normalized, blurred with a 
Gaussian Kernel, and converted to a color map. High scores will appear as dark 
blue and lower scores as dark red. 


Figure 4.3. Saliency Map Algorithm Illustrated 
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CHAPTER 5: 


Experiments and Results 


This chapter is organized by experiment, where each experiment explores a different aspect 
of the RL algorithm. The first experiment establishes baseline results for Karpathy’s and 
Walton’s agents. The second experiment varies training hyperparameters such as the number 
of hidden nodes. The third experiment makes modifications to the policy to try and force 
certain behaviors. The last experiment examines the resultant neural network weights to 


determine if they are recoverable. 


5.1 Reproduce Results 

The first experiment establishes a baseline behavior for agents. We conducted four itera- 
tions of two tests using different seeds. The first test uses the GPU accelerated version of 
Karpathy’s policy network we created. The second test uses our implementation of Walton’s 
human trainer algorithm. Table 5.1 lists the parameters used to train the network. The agents 


were trained using the same hyperparameters with the only difference being the seed. 


Table 5.1. Reproduce Results Experiment Parameters 


Hidden Nodes 


] 
2 
3 
4 
] 
2 
3 
4 





5.1.1 Karpathy’s Policy Network 
The purpose of this test is verify proper implementation of Karpathy’s policy network 


algorithm with GPU optimization. Four agents were trained for 80K episodes using seeds 
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one through four. The results are analyzed using reward curves, activation plots, and saliency 


maps. 


Hypothesis 

The running reward will start at -21, increase over time, and gradually plateau between 
+10 and +21. The hidden layer activations will be stable. That is, throughout training the 
activations will be distributed normally across some mean. The saliency map will depict an 


interpretable strategy for the agent. 


Results 

Figure 5.1 shows the running reward curves for four agents trained with different seeds. 
The agents improved over 80k episodes and had a running reward of approximately +5 
when training stopped. At the start, the agents were unable to score against the Atari agent. 
Around 50,000 episodes, the running reward turned positive indicating the trained agents 


were mostly winning +21 point games. 


Figure 5.2 shows the first 50 hidden node activations collected during episodes 10k, 30k, 


5Ok, 70k. For episode 10k, the activations are representative of the Xavier initializer [27]. 


The initial weights were drawn from a standard normal distribution with variance —— 


V200’ 
where n is the number of hidden nodes in the layer. By episode 10K there were 100 
1 
oan At 30k episodes, the 


upper whiskers are distributed about 0.5. The trend remains the same for episodes 50k and 


parameter updates, so the third quartile becomes approximately 


70k. The outliers become larger ranging from about 0.8 in episode 10k to 2.0 is episode 


70k. We do not see any vanishing or exploding activations. 


The saliency maps in Figure 5.3 show how the agent refined it’s strategy over training. 
At 20K episodes, we see the agent attempts to strike the ball from a paddle position at 
the bottom of the screen. As the ball bounces off the bottom, the agent moves the paddle 
upwards to try and strike the ball, but then retreats downwards. The ball continues moving 
northeast and gets pass the agent. At 40K episodes, the strategy is more refined, but still 
needs improvement. As the ball travels east, the agent attempts to strike it but misses by a 
few pixels distance. At 60K episodes, we see the agent waits at the bottom of the screen by 


taking alternating up and down actions. When the the ball is nearly vertical with the paddle, 
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the agent shoots the paddle up and hits the ball at an angle. Interestingly, the Pong agent 
trained in [33] developed a similar strategy despite having been trained using a different 
RL algorithm. At 80K episodes, the agent is able to strike the ball after it bounces off the 


bottom. The 20K episode agent could not handle this scenario. 


Examining the heatmaps we see that the paddle position of both the agent and opponent 
have the most influence on the neural network output. Surprisingly, the ball often goes 
ignored except when it is nearly vertical with the agent paddle. At the start of training, 
the agent does not know which paddle it controls and must learn to make decisions using 
both paddles. We expected the region surrounding the opponent paddle would become less 
important as training progressed, but this was not the case. Perhaps the agent is using the 


opponent paddle to anticipate where the ball will be after it is hit. 


In summary, the baseline agent with no human trainer performed very well. We did not find 
any abnormalities in the activation plots and the saliency map shows the agent developed a 


reasonable strategy. 
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Top: Four agents were trained, with and without the human trainer, using different 
seeds. Bottom: Side-by-side comparison of agents trained with the same seed. 


Figure 5.1. Reward Curves for Karpathy'’s and Walton's Agents 
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Each boxplot shows activations for the first 50 hidden nodes collected during a 
single episode. 


Figure 5.2. Activation Plots for Karpathy's Agent 
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Saliency maps depict selected frames from a game segment. Videos are available 
on Github. 


Figure 5.3. Saliency Maps for Karpathy'’s Agent 
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5.1.2 Walton’s Human Trainer 

The purpose of this test is to reproduce the behavior of Walton’s agents trained using the 
defensive strategy explained in 3.3. Our GPU optimized version of Walton’s algorithm was 
used. Four agents were trained for 80K episodes using seeds one through four. The results 


are analyzed using reward curves, activation plots, and saliency maps. 


Hypothesis 

We expect to see similar results as [9]. Early in training, the agent is expected to improve 
at a faster rate than the agent trained without a human trainer, but at a certain point the 
running reward will plateau and begin to slowly decay back towards -21. The activation 
plot is likely to show some abnormalities as compared to Karpathy’s agent. We believe 
the agent will still show some interpretable strategy even late in training when the running 
reward has fallen to -21. The saliency maps are likely to show that pixels surrounding the 
ball were most important to the neural network. The paddles will have some affect of the 


output probability, but not nearly as much as the ball. 


Results 

From Figure 5.2, we see the agents showed similar behavior as [9]. For the first 30k 
episodes, Walton’s agents averaged +1 to +2 points better per episode as compared to the 
agents without human trainer. Between 30k and 40k episodes, the agents’ performance 


plateaued and then decayed rapidly until converging to -21 (not scoring any points). 


Analyzing the activations of the hidden nodes provides some explanation for the degrading 
performance. From Figure 5.4, the range of activations widens from approximately [0, 1) in 
episode 10k to [0,9) in episode 70k. Compared to the activations from test 1, the activations 
for Walton’s agents are cause for concern. In episode 70k, every neuron has either vanished 
or exploded. Those that exploded have a maximum around 2.0, which is four times larger 
than Karpathy’s agent. The vanished activations have squashed quartile ranges at or near 


Zero. 


The saliency maps in Figure 5.5 partially captures the behavior of the agent at different 
points in the training cycle. At 20K episodes, we do not see a clear strategy forming as we 


did in Karpathy’s agent. The paddle is mostly acting randomly while at times trying to keep 
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horizontally aligned with the ball. The heatmap shows the ball and paddles had the most 
influence on the neural network similar to Karpathy’s. At 40K episodes, the agent is playing 
better, but is still missing the ball often. We see the paddle attempt to strike the ball, but it 
misses. At 60K episodes, we can see the agent tried to adopt a similar strategy to Figure 5.3. 
The paddle is near the bottom of the frame waiting as the ball approaches. Once the ball is 
nearly vertical with the paddle, the agent shoots its paddle upwards trying to send the ball to 
upper left corner. Interestingly, this is not the strategy the human trainer was trying to teach 
the agent. The trainer encourages a defensive strategy where the paddle is kept horizontal 
with the ball. At 80K episodes, the agent has lost the ability to defend or score. The paddle 
is locked at the top of the frame and occasionally comes down slightly. The Atari agent is 
able to easily score continuously. This follows the reward curve in Figure 5.1 that shows the 


running reward is at -21 by 80K episodes. 


The saliency maps suggest that influence from the human trainer prevented the agent from 
learning how to execute its strategy well. Additionally, the agent failed to learn the defensive 
strategy encouraged by the human trainer. Perhaps the two strategies were in conflict and 


the agent was not able to perfect either one. 


5.2 Vary Hyperparameters 
The second experiment includes two tests. The first test trains many agents with varying 
intensity values. The second test varies the number of hidden nodes in the neural network. 


Table 5.2 lists the parameters used to train the agents for each test. 


5.2.1 Intensity 

The purpose of varying intensity is to see how the application of Walton’s heuristic affects 
agents. The intensity parameter, i, is a value between O and 1 that determines how much 
influence the teacher has over the agent. The NN output represents the probability an agent 
should move up, Pup. If the teacher recommends the agent move up, the agent will move 
up with probability p,,» * (1 + i). If the teacher recommends a downward move, the agent 
will move up with probability p,, * (1 — i). In some frames the teacher may not make a 
recommendation because the paddle is already horizontal with the ball. In this case, the 


agent will simply move up with probability pyp. 
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Each boxplot shows activations for the first 50 hidden nodes collected during a 
single episode. 


Figure 5.4. Activation Plots for Walton's Agent 


Hypothesis 
Low intensity values are expected to produce agents that do not have degrading performance 
over time. The human trainer is likely to have little affect on the agent and it will perform 


similarly to Karpathy’s agent. Higher intensity values are expected to produce agents that 
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Saliency maps depict selected frames from a game segment. Videos are available 
on Github. 


Figure 5.5. Saliency Maps for Walton's Agent 


do not improve. The human trainer will have too much influence on the decision making 


process and will be an impediment to the learning algorithm. 
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Table 5.2. Vary Hyperparameters Experiment Parameters 


Hidden Nodes 


| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 





Results 

The intensity parameter significantly affects agent performance. Figure 5.6 shows higher 
intensity values cause agents to perform better during the initial stages of training, but the 
running reward begins decaying earlier in the training cycle. All of the agents show eventual 
performance decay except the agent with a 0.01 intensity setting. The overall performance 


of the 0.01 agent resembles Karpathy’s agent. 


The activation plots for all of the agents from this test are provided in Appendix A.4. The 
plots are intensity values 0.20, 0.25, 0.30, 0.40, and 0.50 largely follow Figure 5.4. We see 
some nodes are exhibiting vanishing activations while others are exploding. The scale of 
exploding activations does vary. For example, at 50K episodes the 0.20 intensity agent has 
a maximum activation around 8.0 while the 0.50 intensity agent had activations greater 
than 10.0. We also see exploding and vanishing activations appear earlier in training when 
intensity is higher. In Figure A.8, at 30K episodes, more than half the nodes have anomalous 
activations. For the 0.01 intensity agent, the activation plot is similar to Figure 5.2. At 80K 


episodes, there does not appear to be any anomalous activations. For the 0.05 and 0.10 
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Figure 5.6. Reward Curves for Varying Intensity 


intensity agents we do start to see the scale of activations increase as more episodes are 
played. Based on the trajectory of the curves in Figure 5.6, the 0.10 intensity agent is likely 
to see its running reward decay to -21. For the 0.01 and 0.05 intensity agents it is less clear 


if performance will continue to decay. 


Saliency maps are provided for the agent trained with 0.10 intensity. At 20K episodes, 
the strategy resembles Karpathy’s after the same amount of training. At 40K episodes, the 
agent is still performing well. In the frames shown, the ball bounces off the top and the 
agent moves the paddle to the very bottom. It appears the agent is going to miss, but then 
the paddle shoots up and is able to hit the ball. At 60K, we do not see deterioration in the 
strategy. The agent appears to be performing well, but based on the running reward we know 
the agent is becoming less effective at hitting the ball. Interestingly, we do not see the ball 
pixels having much influence on the neural network, yet the agent is still able to hit the ball. 
At 80K episodes, we do not see much change from 60K. The running rewards are within a 


few points of each other, so it would seem the agent is maintaining its strategy. 


See Appendix A.3 for links to saliency map videos for the other agents. The saliency maps 
for 0.20, 0.25, 0.30, 0.40, and 0.50 resemble those of Figure 5.7 after the agent has been 
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Saliency maps depict selected frames from a game segment. Videos are available 
on Github. 
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Figure 5.7. Saliency Maps for Walton's Agent (0.10 Intensity) 


trained to a -21 running reward. No definite strategy was developed earlier in training. The 


agent behavior was a combination of randomness with some ball hits following the strategy 
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in Figure 5.7. The 0.05, 0.10, and 0.10 intensity agents at 1OOK episodes show a strategy 
similar to Karpathy’s agent. This makes sense since the intensity parameter is small, the 


action selection is mostly influenced by the neural network alone with only a small bias. 


5.2.2 Hidden Nodes 

The purpose of varying the number of hidden nodes is to generate a representative sample 
of activations from more than one neural network architecture. One overarching hypothesis 
is that the human trainer is causing some activations to vanish or explode. This test will add 


variation to the data that can be analyzed to test the hypothesis. 


Hypothesis 

All of the agents are expected to perform similar to Walton’s agent. Analyzing the activations, 
we expect that networks with fewer hidden nodes will have vanishing activations earlier 
in training than those with more hidden nodes. Performance is expected to decline sooner 


when there are less hidden nodes. 


Results 

Every agent experienced performance decay between 25k and 30k episodes. Agents with 
more hidden nodes performed better early on, but they also degraded at a faster rate. For 
200 hidden nodes, the running reward peaked around 30K episodes and then declined to -21 
at 75K episodes. However, for 100 hidden nodes, there was a longer plateau in performance 
between 30K episodes and 50K episodes. The perform declined after 50K episodes and did 


not reach -21 until around 100K episodes. 


When comparing the 200 hidden node agent’s activations in Figure 5.4 with Figure 5.9, we 
see that the 100 hidden node agent was more resistant to the vanishing hidden nodes and 
the scale of activations was smaller. The for 200 hidden node agent, the we see vanishing 
activations at 70K episodes and values above 8.0. The 100 hidden node agent had similar 
results but not until LOOK episodes. This suggests that networks with fewer hidden nodes 
were more resistant to vanishing and exploding activations. The activation plots for the 


remaining agents is provided in Appendix A.4. 
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Figure 5.10 shows saliency maps for the agent trained with 100 hidden nodes. At 20K 
episodes, there was not much of a discernible strategy, but the agent was able to hit the ball 
occasionally. At 40K episodes, the strategy resembles Karpathy’s. The paddle waits for the 
ball and then shoots up to try and hit it. At 60K episodes, we see the same strategy as 40K, 
but the agent misses the ball more often. At 80K episodes, the agent misses the ball often, 
but still attempts to hit it. The heatmap for 80K episodes shows the neural network was not 
affected much by the ball pixels. For all episodes, the opponent and agent paddles remained 


the dominate deciding factor for action selection. 
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Figure 5.8. Varying Hidden Nodes Reward Curves. Agents were trained with 
varying numbers of hidden nodes for 100k episodes. 


5.3. Modification to the Policy Network 

Based on results from experiments | and 2, the activations for Walton’s agents were irregular 
compared to those of Karpathy’s agents. Experiment 3 makes modifications to the policy. 
The goal is not to find a ‘fix’ for the human trainer, rather to explore conditions that may 


provide insight into best practices for incorporating a human trainer into learning. 
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Each boxplot shows activations for the first 50 hidden nodes collected during a 
single episode. 


Figure 5.9. Activation Plots for Agent with 100 Hidden Nodes 


5.3.1 Test 1: Decay Intensity 
The purpose of this test is to determine whether decaying the influence from the human 
trainer has any impact on agent performance. Recall that the intensity parameter determines 


how much influence the teacher has over the agent. Walton’s implementation used a constant 
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Saliency maps depict selected frames from a game segment. Videos are available 
on Github. 


Figure 5.10. Saliency Maps for Walton's Agent (100 Hidden Nodes) 


intensity parameter. In this test, the intensity is decayed after each batch update. The intensity 


after the batch update is the product of the intensity from the previous batch and the decay 
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rate. Table 5.3 lists the parameters for this test. 


Table 5.3. Modification to the Policy Network Experiment Parameters 


Intensity Decay 


— 





| 
| 
| 
| 
| 
| 
| 
| 
| 
| 


Hypothesis 

Decaying the intensity will result in less influence from the human trainer over episodes. 
The higher the decay rate, the better overall performance. We expect Walton’s baseline agent 
to outperform for the initial 20k episodes, but the agents with decaying intensity will not 


perform better than Karpathy’s in the long run. 


Results 

The reward curves in Figure 5.11 shows that most of the agents performed very well. The 
0.5 intensity agent with a 0.9999 decay rate did not perform well. This is likely because the 
decay rate was too small for the neural network to overcome the influence from the human 
trainer. Interestingly, the 0.5 intensity agent performed better than the rest between episodes 
zero and 10K. There were a few agents that performed better than Karpathy’s baseline, at 
least for the first 60K episodes. However, from Table 5.4, Karpathy’s agent was performing 
best at 60K episodes with a running reward of +2.33 compared to the next best agent with 
+2.03. These results suggest that using a human trainer to bias action selection may not be 


the best way to accelerate performance. 
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Reward curves for six agents trained with varied decays of the human trainer 
intensity. Karpathy's agent is also plotted to provide a baseline for comparison. 


Figure 5.11. Reward Curves Decaying Intensity 


Figure 5.12 shows saliency maps for the agent trained with 0.15 intensity and 0.999 intensity. 
The agent performed similarly to Karpathy’s and we wanted to determine if they developed 
the same strategy. At 20K episodes, there was evidence of a similar "kill shot" strategy. 
However, at 40K and 60K episodes the agent was more erratic. When the ball was traveling 
towards the opponent, the agent’s paddle would make an effort to keep horizontally aligned 
with the ball. However, as the ball approached the agent, the paddle would either move to the 
top or bottom of the screen and wait to strike the ball. This suggests that the agent adopted 
a slightly different strategy then Karpathy’s and the human trainer was able to impart some 


strategy to the agent, but only when the ball position was less relevant. 


5.3.2 Dropout 
One possibility is that Walton’s agent is overfitting to the strategy taught by the human 
trainer. The agents are not given enough time to explore alternative strategies. Srivastava et 


al. introduced dropout as a technique to address the problem of overfitting [36]. Dropout 
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Table 5.4. 60K Episode Running Reward Decaying Trainer 


0.15 
0.15 
0.15 
0.50 


0.50 
0.50 
0.0 (Karpathy) 





prevents individual units from having too much influence on the network by randomly 
dropping nodes from the neural network. Let x denote the vector of difference frame pixels 
input to the hidden layer, h, and y denote the vector of outputs from the hidden layer. The 


feed-forward of the neural network with dropout at the hidden layer becomes 


rj; ~ Bernoulli (p), 


—_ 


Vy =F sXx, 
ZL= Wy, 
y = ReLu(z), 


where r is a vector of independent Bernoulli random variable, and p is the probability of 


the variable being |. We trained five agents using different p values as shown in Table 5.3. 


Hypothesis 
We do not believe dropout will help agents with Walton’s human trainer. From Figure 5.4, 
all 5O nodes produced large activations as compared to Karpathy’s baseline agent. There 


were not any individual outlier nodes that the neural network became overly reliant on. 


Results 
The original agent without any dropout had the highest peak performance as shown in 
Figure 5.13. The higher the dropout rate, the lower the maximum running reward. However, 


the 0.4 and 0.5 dropout agents showed performance degradation at a slower rate. At 60K 
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Saliency maps depict selected frames from a game segment. Videos are available 


on Github. 
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Figure 5.12. Saliency Maps for Walton's Agent (0.15 Intensity, 0.999 Decay 
Rate) 


episodes all of the agents except the 0.4 and 0.5 dropout agents had close to a -21 running 
reward. These results are not too surprising. The use of dropout in RL has not been widely 
studied. They have been shown to be most effective in convolutional neural networks, which 


we do not implement. 


The saliency maps in Figure 5.14 do not provide any additional insight. The agent mostly 
moved erratically, and any ball hits seemed to be from randomness. The pixels surrounding 


and the ball and paddles all showed strong heat signatures making it difficult to discern 


55 


Dropout 
— 0.0 
— 0.1 






ome ().7 
—e ():3 


Running Reward 


== 0.4 


= 0.5 


-20 


5 : ; i | ue | | 
iN Ma cr Whe a Sn 


a eh te tural 


0 20000 40000 60000 
Episode 


Figure 5.13. Reward Curves for Agents with Dropout 


which parts of the screen were have the most influence on the neural network. 


5.4 Recovering Bad Agents 

The purpose of this experiment is to determine if faulty weights can be recovered. We take 
the resultant weights from Walton’s poorly performing agents and try to retrain them using a 
different policy. The goal is to understand whether the ‘bad’ weights retained any knowledge 
from training. We know Walton’s agent had about a -5 running reward at 30k episode before 
performance degraded to -21. There is potentially some strategy that can still be extracted 


from the weights. 


Remove Human Trainer 
This test retrains the faulty weights with the human trainer removed. The weighted average 
of the running gradients for RMSProp is reset to zero. We use four sets of weights from 


Walton’s results to initialize the agents. 
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Saliency maps depict selected frames from a game segment. Videos are available 
on Github. 
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Figure 5.14. Saliency Maps for Walton's Agent (0.5 Dropout) 


Hypothesis 
The running reward will slowly improve at the start and eventually perform just as well as 


Karpathy’s agent. 


ay 


Results 

The four agents never improved. The running reward stayed at or near -21 for 80k episodes. 
Given the high variance of activations seen in Figure 5.4, the distribution of weights have 
become too wide and the gradient has become stuck in the local minimum. The neural 
network updates are not able to recover the agent. Watching the agent play Pong, we see 
that the paddle is completely stuck at the top of the screen and is not able to return any shots 


even in later episodes of training. 


Alternate Optimizer 

This test removes the human trainer and also uses a different optimizer, Adam [37]. The 
Adam optimize uses the exponential moving average of the gradient, similar to RMSProp, 
and the average of the uncentered variance. Adam also introduces a bias-correction term 


that prevents the optimizer from making very large updates to the weights. 


Hypothesis 
If the policy of Walton’s agents are stuck in a local minimum, Adam has the potential to 
help the weights escape and train to a better strategy. We expect the agent to see some 


performance improvement using Adam. 


Results 


The results mirror those in 5.4. The agent did not show any performance improvement. 


5.5 Results Summary 

In this chapter we showed that the GPU implementation of Karpathy’s algorithm to train 
a Pong agent works. The baseline agent continuously improved over 80K episode and 
performed better than the Atari opponent. Our implementation was also able to reproduce 
results from [9] where the a human trainer was used to accelerate learning. As Figure 
5.1 showed, learning with the same human trainer accelerated learning for the first 30K 
episodes, but caused a performance decay after 30K episodes. The saliency maps showed 
that Karpathy’s agent adopted a kill shot strategy where the agent would attempt to hit the 
ball using a rapid upwards or downwards paddle movement. Walton’s agent had similar 


behavior early in training, but eventually the paddle got stuck at the top of the screen. 
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This chapter also showed that adjusting the intensity of the human trainer significantly 
impacted performance. The greater the influence from the human trainer, the faster the 
agent would learn early on and the early in training performance would start to decline. 
Only when intensity was 0.1 did we not see a performance decay up to 80K episodes. 
Similarly, networks with more hidden nodes saw a faster rate of learning, but also an earlier 


point of decline. 


Decaying the intensity parameter was effective in preventing the performance decline seen 
with Walton’s agent, but by 60K episode Karpathy’s agent had a higher running reward than 
the agents with a human trainer. The higher rate of performance early in training had no long 
term benefit. This chapter also looked at dropout as a possible technique to prevent the agent 
with a human trainer from overfitting to a strategy. Our results showed this method was not 
effective. Lastly, we retrained an agent using Walton’s weights see if it would recover, but it 


did not. Changing the optimizer also did not effect the outcome. 
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CHAPTER 6: 


Alternative Human Trainer 


Based on the results in Chapter 5, we believe there is a point where the agent can no longer 
learn from a human trainer, and any heuristic still in effect will become an impediment to 
learning. The saliency maps from Chapter 5 gave us insight into the low-level strategies de- 
veloped by the agent, which motivated this experiment. Our goal is to further understand the 
phenomenon whereby conflicting policy and human trainer strategies results in unavailing 


agents. 


Karpathy’s and Walton’s agent both gravitated towards a kill shot strategy. However, at 
around 30K episodes Walton’s agent starts missing the ball more often. Walton’s agent 
was attempting to make kill shot, but the human trainer was forcing it to make decisions 
using an alternate strategy that considered only a small subset of pixels. We had expected 
that Walton’s agent would learn to play more defensively to mimic what the human trainer 
was trying to teach. This was not the case and brought us to pose a new question. What if 
the human trainer made recommendations that better followed the kill shot strategy being 


learned? 


In this chapter, we describe a new human trainer, Hee’s trainer, that will try to encourage the 
agent to make kill shots using a simple heuristic. The purpose is to validate our belief that 
when human trainers are applied there will be an inflection point in performance because 


there is no way to perfectly model the agent’s desired strategy. 


6.1 Hee’s Trainer 

Move recommendations influence the neural network in the same way as described in 3.3. 
The difference between Walton’s trainer and Hee’s trainer is when recommendations have 
an affect on action selection. Walton’s trainer follows the ball throughout the game and 
continuously makes recommendations. Hee’s trainer ignores the ball whenever it outside 
of a five pixel horizontal distance from the agent’s paddle. When the ball is within five 
horizontal pixels of the paddle, the recommendation follows that of Walton’s. Increase the 


probability of moving up if the ball is above the top pixel of the paddle. Decrease the 
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probability of moving up if the ball is below the bottom pixel of the paddle. Otherwise, do 
not make a recommendation. The idea is to let the agent act independently of the trainer 
throughout most of the game. Only use the trainer when the ball is coming within striking 
range. The recommendation should help the agent become better at striking the ball with a 


kill shot early in training. 


6.2 Experiment 

Initially, four agents were trained using Hee’s trainer with different seeds. After examining 
the results, we decided to train an additional four agents initialized with the weights from 
Hee’s trainer at 40K, but with the recommendation ignored. The goal was to see if Hee’s 
agent could continue to outperform. If so, it would confirm our hypothesis that that there is 


a point of inflection where human trainers are no longer effective. 


Hypothesis 
We expect Hee’s agent to outperform Karpathy’s in the early stages of training, but there 
will be a point of convergence where Karpathy’s agent begins to outperform Hee’s. We 


expect agents initialized with Hee’s weights will perform similarly to Karpathy’s. 


Results 

Figure 6.1 compares the performance of Hee’s agent against Karpathy’s and Walton’s agents. 
Hee’s agent performed best during the initial 40K episodes. Performance begins to decay 
after 40K episodes but at a slower rate than that of Walton’s. At 60K episodes, Hee’s agent 
was performing nearly +8 points better than Walton’s. This suggests the heuristic for Hee’s 
agent aided learning more effectively than Walton’s. There is an inflection point where 
Karpathy’s agent starts to outperform Hee’s, but this occurs roughly 10K episodes after 
the Walton’s agent starts to decay. This confirms our belief that human trainers can be an 


impediment to learning. 


Figure 6.1 also shows how Hee’s agent performed when the heuristic was removed at 40K 
episodes — a choice we made after observing the performance decline during the original 
agent’s training. Interestingly, the agent continued to get better and actually outperformed 


Karpathy’s. The initial spike from -8 to -2 occurred because we restarted training from the 
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initial learning rate. This results suggests that human trainers can be effective at accelerating 
performance, but they must be decayed or removed at some point in training. One technique 
to remove the the trainer would be to track the moving average of the reward and then 
eliminate the trainer when the moving average starts to decline. The slope of the decline 


may need to be tuned as a hyperparameter. 


Figures 6.2 and 6.3 shows activation plots for both of Hee’s agents. Activations are expe- 
riencing high variance around 60K when trainer was still being incorporated in learning. 
This follows Figure 5.4 where activations got as high as +8. In this case, activations are still 
less than +4, but we expect that the variance will continue to increase with training. When 
the heuristic was removed at 40K episode, the activations stayed below +2.5 even at 60K 


episodes. 


In the Figure 6.4 saliency maps we had expected to see the ball show a strong heat signature 
when getting within five horizontal pixels of the paddle, but this is not the case. Additionally, 
at 60K episode Hee’s agent is still playing quite well, but there is less emphasis on the ball. 


Figure 6.6 compares Karpathy’s, Walton’s, and Hee’s agents using saliency maps at 60K 
episodes. Hee’s and Karpathy’s agents behaved most similarly. We see the paddle at the 
bottom of the screen as the ball approaches before attempting to strike the ball. Walton’s 
agent tries to stay horizontally aligned with the ball at first, then moves quickly to the bottom 
before coming back up to strike the ball. 


6.3. Results Analysis 

The purpose of this Chapter was to test the hypothesis that incorporating a human trainer 
into the RL algorithm will impede learning. This experiment provides evidence that policy 
gradient networks and human trainers are not compatible. One possible reason is that the 
neural network makes decisions based on every pixel in the game whereas the human 
trainer has a more localized approach. The saliency maps emphasize the regions around 
both paddles and the ball. This suggests the agent is able to cue off of the opponents actions 
to set itself up better for a scoring strike. The view that human advice is relevant to decision 
making only holds early in training. Once the policy has had some amount of steps to learn, 
the human advice is no longer relevant and needs to be ignored if the agent is to continue to 


improve. 
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Figure 6.2. Activation Plots for Hee’s Agent 
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Figure 6.3. Activation Plots for Hee’s Agent Without Trainer 
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Saliency maps depict selected frames from a game segment. Videos are available 
on Github. 








Figure 6.4. Saliency Maps for Hee's Agent 


66 


50K Episode 
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Saliency maps depict selected frames from a game segment. Videos are available 


on Github. 





Figure 6.5. Saliency Maps for Hee’s Agent Without Trainer 
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Agents were trained for 60K episode, except in the case of "Hee Retrained" where 
the agent was initialized with 40K episode weights from Hee’s agent and trained 
for 20K additional episodes. 


Figure 6.6. Saliency Map Comparison 
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CHAPTER 7: 
Conclusions and Future Work 


Recent advances in artificial intelligence have been impressive with RL applications show- 
ing the most promise. In game environments, such as Atari, agents are consistently outper- 
forming humans. Nonetheless, there is still a tremendous gap in the research communities 
regarding why agents develop the strategies that they do and under what conditions can they 


be fooled into taking nocuous actions. 


Incorporating human trainers into the learning algorithm is one approach to addressing 
safety concerns and accelerating learning, but our results show that human trainers may not 
be reliable. The RL algorithm has no way of reconciling conflicting strategies. The neural 
network believes the best approach to maximizing rewards is to take some action while the 
trainer may be encouraging it to take a different action. In the early stages of training, this 
is not an issue because the agent behaves mostly randomly because of the random weight 
assignments. The human trainer is able to guide the agent towards taking a slightly less 
random action. This is why we see Walton’s and Hee’s agents outperform Karpathy’s early 


On. 


7.1 Conclusions 

When training a neural network to play Atari Pong, we found that the policy learned a 
strategy whereby the agent would attempt to hit the ball using a rapid upwards or downwards 
movement. Saliency maps indicated that the NN was mostly reliant on ball trajectory and 
the paddle movements of the agent and opponent. Incorporating a human trainer into the 
RL algorithm had adverse long term effects. The agent learned to play better at first, but 
past 30K episodes performance began to decline until eventually the paddle became stuck 
at the top of the frame. Analyzing activations of hidden nodes showed that the variance 
of activations exploded. Perunicic showed that vanishing and exploding activations are 
an impediment to efficient learning [31]. Backpropagation requires that the variance of 
activations be maintained about the same variance of initialization. Small activation result 
in zero gradients while extreme activations diverge towards zero derivatives. Both cases 


prevents the weights from updating efficiently. 


69 


Declining performance and exploding activations was seen with both Walton’s and Hee’s 
agent. However, we showed that decaying or removing the human trainer before the running 
reward trends negative avoids adverse behaviors. Based on the two rules examined in Pong, 
we believe there is a point of inflection whereby the agent has learned all it can from the 
human and must be allowed to act independently. If the human trainer is not removed or 
decayed by that point of inflection, the agent’s performance will suffer and the variance in 


activations will become very large. 


The methodology of this thesis relied heavily on reward curves, saliency maps, and analysis 
of activations. Each of these methods provided new insight into RL algorithms with human 
trainers. However, there are still notable limitations to these approaches. The reward curves 
only show performance against a statically defined opponent and does not generalize perfor- 
mance against other opponents such as humans or adversarial agents. Saliency maps show 
how individual frames affect NN probabilities. Sequential frames must be interpreted by 
a human observer in order to characterize strategy. This characterization is subjective and 
does not provide empirical evidence of a particular strategy. Analyzing activations allowed 
us to show how the variance of the gradient diverged, but a formal approach is still needed 
to explain why the divergence occurred. Each of these shortfalls in our methodology may 


be explored in future work. 


7.2 Future Work 


This thesis showed that decaying the human trainer or removing the rule after some bootstrap 
cycle was able to accelerate training in Atari Pong. One issue is that the decay rate for 
the human trainer becomes another hyperparameter that must be tuned. The removal of 
the human trainer can be automated using the running reward as an indicator of declining 
performance, but the rate of decline before removing the trainer would have to be considered 
a hyperparameter. From our Pong results, we could optimize the decay rate and rate of 
decline, but this would require we first train the agent and then retroactively optimize. Zador 
was critical of the research communities adoption of the "whatever-works-best" approach 
to advancing artificial intelligence [38]. Although our results show that hyperparameters 
can be tuned to train an agent faster, it is not practical and does not guarantee the agent 
will behave in alignment with the human trainer. Thus, biasing action selection may not 


be a good approach to addressing safety concerns. There are, however, other approaches 
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to shaping learning with human trainers. Knox et al. has proposed eight possible methods, 
but they have not been individually studied in the level of detail we believe is necessary to 
characterize strategy and long term performance. Future work should investigate alternative 


human trainers in order to generalize the results from this thesis. 


A possible reason for performance inflection is that the policy network relies on the fact 
that most actions in the long run will be properly marked with the discounted reward. 
When there’s a human trainer, the policy network may make the best decision but then 
received a negative reward because of the heuristic. During backpropagation the network 
will discourage the action when in fact it should have been encouraged. If there are a greater 
number of false markings than true then the policy cannot learn and we end up with the 


behavior seen in the results. This possibility should be investigated further. 


An important limitation of this thesis was the singular training environment. Future work 
should investigate other game environments in an attempt to reproduce the behavior seen in 
this thesis. As we showed with Hee’s agent, the trainer does not need to be overly complex. 
Thus, we believe it is reasonable to conduct the performance and saliency map experiments 


across all of the Atari games. 


The exploding activations seen in Walton’s agent should be examined. A formal investigation 
into the policy gradient algorithm with emphasis on how the biased action selection affects 


backpropagation is needed. 
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APPENDIX: Supplementary Information 


This appendix provide supplementary information to include performance statistics, infor- 


mation regarding source code, and additional figures of interest. 


A.l Traming Hardware and Performance 

Training occurred on a personal laptop computer, HPC cluster and a machine learning 
workstation. A personal computer was to debug features added to Karpathy’s and Walton’s 
implementation. HPC was used early in testing, but was later abandoned in favor of the 
machine learning workstation which has four high performance GPUs. Specifications for 
the personal computer and machine learning workstation are provided with performance 


benchmarks. 


A.1.1 Hardware Specifications 


The specifications for the personal computer are as follows: 


¢ System: ASUS GL552VM 

e Operating System: Windows 10 Pro 

e CPU: Intel Core i7-6700HQ at 2.60 GHz, 4 Cores 

¢ System Memory: 18GB SODIMM DDR3 

¢ GPU: NVIDIA GeForce GTX 960M, 8 GB of memory 
¢ GPU Driver: CUDA Toolkit 9.1 


The specifications for the workstation are as follows: 


¢ System: NVIDIA DGX Station 

¢ Operating System: Ubuntu Desktop Linux 18.04 
¢ CPU: Intel Xeon E5-2698 v4 2.2 GHz, 20 Cores 
¢ System Memory: 256 GB RDIMM DDR4 

¢ GPU: 4x Tesla V100, 32 GB of memory 

e¢ GPU Driver: CUDA Toolkit 10.1 


The training algorithm is a single core applications, so the number of cores in each system 
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does not impact training performance unless multiple processes are launched simultane- 
ously. Similarly, the CuPy implementation uses a single GPU. The amount of GPU memory 
used to train a single neural network is less than 8 GB, so the personal computer was able 


to hold the entire network in GPU memory. 


A.1.2 Performance 

The benchmarks provided compare the performance of Karpathy’s and Walton’s implemen- 
tation to the GPU optimized implementation used in this thesis. Given the same training 
parameters, results will match across platforms. We measured the time required to train 
an agent for 100 episodes with all parameters being the same. The tests were performed 


sequentially on each platform. The results are provided in Table A.1. 


Timing was done on Windows using the Measure-Command utility and on Linux using 


timeit. 


Table A.1. Performance Testing. The implementation was ran for 100 
episodes, with a batch size of 10, 200 hidden layer nodes, intensity of 0.15, 
and seed of 1. Walton's and Hee’s implementation include the human trainer. 
Hee's implementation uses GPUs. 


Hardware Karpathy Walton* Hee*™* 

CPU Ticks/Time (m) | CPU Ticks/Time (m) | CPU Ticks/Time (m) 
PC 9.36 11.06 21.80 
NVIDIA 4.86 7.90 5.40 


Karpathy’s implementation was trained 10 batches faster than Hee’s and Walton’s, but does 
not include the human trainer. The trainer needs to scan every frame to locate the ball so the 
overhead is substantial. Walton’s CPU implementation performed better on the PC with a 
low-end GPU. Hee’s implementation took nearly twice as long on the PC. Using the NVIDIA 
DGX Workstation, the GPU implementation was about 32% faster than the Walton’s CPU 
version. Hee’s implementation for only 10% slower than Karpathy’s. Therefore, the GPU 
implementation is a better choice over Walton’s when adequate hardware is available. On a 


consumer PC, it is best to use NumPy for computation. 


A.2 Source Code Information 
All of the source code used in this thesis is made public at https://github.com/brandonhee/rl- 


pong. 


A.3 Saliency Map Videos 


All of the videos generated from the agents trained in this thesis is made public at 


https://github.com/brandonhee/rl-pong. 


A.4 Additional Results 


Provided are supplemental figures for many of the agents trained in this thesis. 
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Figure A.1. Activation Plots for 0.01 Intensity 
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Figure A.2. Activation Plots for 0.05 Intensity 
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Figure A.3. Activation Plots for 0.10 Intensity 
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Figure A.5. Activation Plots for 0.25 Intensity 
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Figure A.6. Activation Plots for 0.30 Intensity 
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Figure A.8. Activation Plots for 0.50 Intensity 
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Figure A.12. Activation Plots for 180 Hidden Nodes 
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